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Abstract 

(SJQf The quadratic assignment problem (QAP) is one of the most difficult combinatorial optimization problems. 

An effective heuristic for obtaining approximate solutions to the QAP is simulated annealing (SA). Here 
we describe an SA implementation for the QAP which runs on a graphics processing unit (GPU). GPUs 
are composed of low cost commodity graphics chips which in combination provide a powerful platform for 
general purpose parallel computing. For SA runs with large numbers of iterations, we find performance 
fa 50 — 100 times better than that of a recent non-parallel but very efficient implementation of SA for the 
QAP. 
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1. Introduction 

Recently, relatively inexpensive commodity 
graphics chips produced in large volumes to pro- 
vide basic graphics support on personal computers 
have been combined into powerful graphics process- 
ing units (GPUs) which can be used for general pur- 
pose parallel processing Here we describe the 
implementation of the simulated annealing heuris- 
tic on such a GPU. 

Originally formulated by Koopmans and Beck- 
mann [lij ]. the QAP is NP-hard and is considered 
to be one of the most difficult problems to be solved 
optimally. It was defined in the following context: 
A set of N facilities are to be located at N locations. 
The quantity of materials which flow between facil- 
ities i and j is Aij and the distance between loca- 
tions i and j is By. The problem is to assign to 
each location a single facility so as to minimize the 
cost 



N N 



p{i),pU) ' 



(1) 



= 1 3=1 



where p(i) represents the location to which i is as- 
signed. 



There is an extensive literature that addresses the 
QAP and which is reviewed in Refs. [Il.l6l.llol. ll3l . fl5l 
|j| . The QAP is exceedingly hard to solve optimally. 
With the exception of specially constructed cases, 
optimal algorithms have solved only relatively small 
instances with N < 36. Therefore, various heuris- 
tic approaches have been developed and applied to 
problems typically of size N ss 100 or less. In con- 
trast, a travelling salesman problem consisting of 
almost 25,000 towns in Sweden has been solved ex- 
actly 0| . 

2. Background 

2.1. Simulated Annealing 

The simulated annealing heuristic was first ap- 
plied to the QAP by Burkhard and Rendl [5j and 
was refined by Connolly 0]. The heuristic consists 
of swapping locations of two facilities. Proposed 
swaps can either be determined randomly or se- 
lected according to some sequential enumeration of 
all possible swaps. For each proposed swap, the 
change in cost, 8, for the potential swap is calcu- 
lated. The swap is made if 
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where T is an analog of temperature in physical sys- 
tems that is slowly decreased according to a spec- 
ified cooling schedule after each iteration and r is 
a uniformly distributed random variable between 
and 1 . Randomly making moves which increase the 
cost is done to help escape from local minima. 

In traditional implementations of the heuristic, 
the cost of making a swap is calculated from scratch 
when the swap is considered in order to determine 
if the swap should be made. 

2.2. Recent efficient non-parallel implementation 

Recently an implementation of SA for the QAP 
was developed which improves performance over 
the traditional implementation by factors of up to 
100 for large numbers of iterations on large prob- 



lem instances 17|. The approach taken is to at- 



tempt to reduce the complexity by maintaining a 
matrix A of costs of each possible swap. Thi s ap - 
proach is motivated by the work of Taillard 18]), 
who applied a similar approach in the application 
of another heuristic, tabu search, to the QAP. The 
A-matrix approach is as follows: 

(a) Initialize by creating a matrix Ajj containing 
the cost of swapping i and j for all i and j, 
given a current assignment p. Set iteration 
number to zero. 

(b) Increment the iteration number. Retrieve the 
cost A riS of the next possible swap (r, s). 

(c) If the cost of the possible swap does not meet 
the criteria in Eq. @, go to (b). 

(d) If the proposed swap meets the swap criteria, 
update p to reflect the swap, update Ajj with 
the new swap costs given the new permutation, 
and go to (b). 

(e) End after I iterations. 

The number of iterations in which a swap is per- 
formed divided by the total number of iterations is 
known as the acceptance rate a(I). The lower the 
acceptance range, the more efficient the A-matrix 
approach. 

In addition to providing higher performance, 
even in a non-parallel environment, than the tra- 
ditional SA implementation, the A-matrix imple- 
mentation lends itself to GPU processing because, 
as explained in Section [3J the costs of all swaps are 
calculated together at the same time. 



2.3. Need for parallelization 

Ideally if many processors are available to run a 
heuristic, the most efficient use of these processors 
is to ran copies of the heuristic independently on 
each processor. However, as shown in [l7| . statis- 
tically, the larger the number of SA iterations, the 
higher the quality of the solution found. Thus if 
high quality solutions are desired, an SA run with 
a high number of iterations can not be replaced 
by many independent SA runs each with a smaller 
number of iterations; a parallel implementation is 
therefore required to allow timely completion of SA 
runs with a high number of iterations. 

2.4- Previous work on parallel heuristics for the 
QAP 

James et al. [ll[ propose a parallel tabu search 
algorithm for the QAP and review previously pro- 
posed parallel heuristics for the QAP most of which 
involve the tabu search heuristic. Recently it was 
shown that if high quality solutions are desired, sim- 
ulated annealing performs better than tabu search 
[la ]. We are not aware of any implementations of 
the simulated annealing heuristic for the QAP on 
a GPU. Simulated annealing presents a complexity 
not present in tabu search. Instead of calculating 
all possible swaps and then choosing one, a swap is 
made when the first swap meeting the requirement 
of Eq. © is found. 

Crainic and Toulous [§] classify parallel heuristic 
methods into three categories. Our implementation 
falls into their Type 1 category — the paralleliza- 
tion of computationally expensive low level opera- 
tions. 

2. 5. Characteristics of graphics processing unit 

We have implemented the SA heuristics on the 
Nvidia C2070 GPU. The important features of the 
NVIDIA GPU architecture are Q: 

• A graphics processing unit contains one or 
more symmetric multiprocessors(SMPs). Each 
SMP contains multiple processing elements 
each of which executes a sequential thread. 
Threads are organized into warps of 32 threads 
each. One or more warps are further grouped 
into a block. 

• Threads within a block can be synchronized 
with a barrier synchronization GPU primitive. 
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• Code which runs on the GPU is contained in 
a kernel. There is a relatively large overhead, 
on the order of microseconds, associated with 
launching a kernel. 

• Each SMP contains a small amount of shared 
memory which all of the processing elements 
can access with relatively low latency. For the 
GPU on which we implemented our program 
shared memory is 48 KB. 

• All SMPs can access a large device memory but 
with high latency on the order of 100 times 
longer than access to shared memory; for the 
GPU on which we implemented our program 
device memory is 1.3 TB. Data must be trans- 
ferred from the host CPU memory to device 
memory before it can be processed by an SMP. 

• When a warp executes an instruction that ac- 
cesses device memory, it coalesces the mem- 
ory accesses within the warp into one or more 
memory transactions for fixed length aligned 
segments(128 bytes for the C2070) in device 
memory. The greater the coalescence (the 
fewer the memory transactions) the better the 
performance. 

3. Approach 

In the A-matrix approach, on which we base our 
GPU implementation, the bulk of the processing 
time is spent (i) updating the matrix A and (ii) 
passing through the matrix elements to identify a 
swap which meets the criteria of Eq. ©. On the 
GPU we implement these steps in the algorithm as 
follows: 

(i) We update A using multiple threads in mul- 
tiple blocks. Each thread calculates elements Ajj 
with a given i and a range of j. If the range of 
j is too small, the setup overhead to process these 
elements will be relatively large. For this reason, 
we limit the number of threads that are used to 
evaluate A with the constraint that each thread 
processes at least 16 matrix elements. 

(ii) The search for the next potential swap which 
satisfies Eq. @ is also performed with multiple 
threads. Starting at the next possible swap after 
the last swap performed, multiple threads evalu- 
ate a subset of all elements of the A-matrix. The 
size of the subset is an adjustable parameter. If 
one or more valid swaps are found in the subset, 
the swap which would have been found first if the 



swaps were evaluated sequentially is chosen. If no 
valid swap is found, the next subset is treated. The 
process is repeated until a valid move is found. If 
the subset size is too small, before a valid swap is 
found many subsets must be processed with some 
attendant overhead. If the subset is too large, more 
swaps than need be are evaluated since many valid 
swaps will be found. Wc found good results when 
the subset size is determined by the requirement 
that each thread processes 16 matrix elements. 

In order to implement the algorithm efficiently on 
the Nvidia GPU the following must be taken into 
account: 

(i) With the exception of small problem instances 
(N < 100. The matrices A, B and A are too 
large to be stored in shared memory and must 
be stored in device memory. As noted above 
in Sec. 12.51 efficient access to device mem- 
ory is achieved when data in contiguous blocks 
of 128 bytes is accessed simultaneously by all 
threads in a warp. In updating the A-matrix, 
terms of the form Ai t jB p ^ p ^ must be evalu- 
ated. Because it is not possible to control the 
permutation p, it is not possible to meet the 
requirement for efficient access when updating 
the A-matrix without a modification in the ap- 
proach. For this reason, we maintain a matrix 

the elements of which reflect swaps which 
have been performed. In conjunction with a 
swap of facilities r and s, after updating p we 
update the matrix B' 

B i,j *~ B ' P (i),pU) ( 3 ) 

for columns and rows of B' with index r or 
s. This is equivalent to swapping rows r and 
s and swapping columns r and s. The oper- 
ation is of complexity O(N) and is performed 
efficiently using N threads making an insignif- 
icant contribution to the processing time. Us- 
ing B' instead of B in the updating of A, the 
terms to be evaluated are of the form AijBlj 
and the requirement for efficient data access 
can be met. The insight that the calculation 
can be transformed in this way is one of the 
keys to good performance. 

(ii) Because all elements of A are updated at 
the same time, we are able to use the lim- 
ited amount of shared memory accessible by 
the SMPs to improve performance of updat- 
ing A as follows. After a swap of facilities 
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r and s, but before updating A, we load 
A r ,k, ^4 s ,fe, B' r k , and B' s k for all k into shared 
memory requiring only O(N) bytes of shared 
memory. No other elements of A and B' are 
are required for the update of elements Aij for 
i and j not equal to r or s; other elements of 
A do require some A and B' device memory 
access. 

(iii) There are points in the processing (e.g. af- 
ter the identification of the next swap, after 
the updating of B' , and after the updating 
of A) at which a given thread must access a 
data element updated by another thread (pos- 
sibly in another block). In order to ensure 
that the data updating is complete before the 
data is used, a synchronization mechanism is 
needed. Threads within the same block can 
be synchronized with the hardware provided 
barrier synchronization mechanism. No such 
mechanism is provided for synchronization be- 
tween threads in different blocks. For this rea- 
son, we have implemented the synchronization 
mechanism for threads in all blocks proposed in 
[19j . When synchronizing among blocks, there 
is a potential for deadlock situations. Deadlock 
can occur when a block is queued for process- 
ing on an SMP while another block is process- 
ing on that SMP. For this reason, we limit the 
number of blocks in a kernel to be no greater 
than the number of SMPs. The alternative of 
completing all GPU processing when synchro- 
nization is required and then re-launching the 
GPU kernel is not feasible because of the asso- 
ciated kernel launch overhead. 

(iv) The highest performance is achieved when as 
many threads as possible are executing simul- 
taneously. Using the A-matrix approach all 
the swaps costs are updated at the same place 
in the implementation, providing a straightfor- 
ward opportunity to use many threads. 



implementation, the performance improvement is 
not dependent on the nature of the instances used 
to measure performance and it is sufficient to com- 
pare performance on one family of instances. 

The non-parallel version, written in CH — h and de- 
veloped in Ref. was run on an Intel Xeon 2.4 
GHz processor. The GPU version was run on a 
Nvidia Tesla C2070 GPU with processing elements 
with clock speed of 1.15 GHz. The GPU contains 
14 SMPs each of which contains 32 processing ele- 
ments. The GPU version is written in C++ with 
Nvidia CUDA language extensions [14] . 

In Fig. 1, for various problem sizes, we plot 
run time versus iterations per run, /, for the ef- 
ficient non-parallel implementation and our paral- 
lel GPU based implementation. We note that the 
GPU based implementation becomes more efficient 
than the non-parallel implementation for / rs 10 4 . 
Until the iteration number becomes ~ 10 6 , the 
non-parallel method does not use the A-matrix ap- 
proach because the acceptance rate of swaps is high 
(see 17( where the role of acceptance rate is dis- 
cussed in detail) . Our parallel implementation uses 
the delta matrix for all values of I and therefore suf- 
fers when the acceptance rate is high as is the case 
for low values of I. This effect is also seen in Fig. 
2 in which we plot the performance improvement 



non— parallel 



parallel 



(4) 



versus I where t non - para uei and t paraUe i are the run 
times for the non-parallel and parallel implementa- 
tions respectively. For all problem sizes studied, P 
increases with increasing number of iterations until 
about I « 10 7 at which point P reaches a plateau; 
the additional time contribution from the initial 10 6 
iterations in the parallel implementation then be- 
comes negligible. 



5. Discussion and Summary 



4. Numerical Results 

We perform numerical experiments on a family of 
random QAP instances created in the same manner 
as the Taixxa instances [H| in QAPLIB Q: the A 
and B matrices are symmetric with zero diagonal; 
the matrix elements are chosen from independent 
uniform distributions. These same instances were 
analyzed in Ref. [ItJ ■ Because the GPU version im- 
plements the identical heuristic as the non-parallel 



Because 

• our parallel implementation provides perfor- 
mance improvements of up to a factor of 100 
versus the non-parallel implementation of the 
A-matrix approach, and 

• the non-parallel A-matrix approach provides a 
performance improvement of up a factor of 100 
versus the traditional implementation (l7j . 
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the current parallel implementation provides a re- 
markable performance improvement of up to 10,000 
versus the traditional implementation which has 
been in use essentially unchanged for over two 
decades. 

We emphasize that we make no change to the 
original SA heuristic to accommodate the GPU en- 
vironment — only to its implementation. Thus our 
GPU implementation of the heuristic provides ex- 
actly the same results, statistically, as does the tra- 
ditionally implemented version. 

For more details on the GPU implementation we 
refer the reader to C++/CUDA code provided as 
supplementary material. 
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Figure 1: Run time versus number of iterations per run for 
random instances for non-GPU implementation (circles) and 
GPU implementation (squares). 
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Figure 2: Performance improvement versus number of itera- 
tions per run 
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